Policy Optimization for Corrupted Markov Decision Processes

Under Review at Neural Information Processing Systems

reinforcement learning

Author

Khang Nguyen, Kenny Guo, David Yu, William Chang

Published

May 15, 2025

If you’ve been keeping up with tech, you’ve maybe heard the term policy optimization tossed around, particularly by notable companies like OpenAI. In essence, policy optimization describes a class of reinforcement learning (RL) gradient-based algorithms that dictate how an intelligent agent should learn optimal actions in an unexplored environment. It has become popular for its relatively simple implementation, computational efficiency, and remarkable empirical success (think ChatGPT post-training, robotic arms, or even beating players at Dota 2). It is also one of the subjects of our recent paper, Policy Optimization in Corrupted Markov Decision Processes, the manuscript of which (currently under review at Neural Information Processing Systems) can be found by clicking the link. This was in collaboration with Khang (Steve) Nguyen, David Yu, and William Chang.

What is a “Markov Decision Process”?

When we talk about reinforcement learning, we first need a model for our learning environment. The canonical RL model is called a Markov Decision Process, or MDP. If you’re familiar with the multi-armed bandit model (a variation of which we studied in this paper), an MDP can be thought of as a generalization of bandits. The main distinction is that now, we have a set of states, \(\mathcal{S}\) (we assume finite), each with a selection of actions from a total action set \(\mathcal{A}\) (again, we assume finite). Furthermore, these states are partitioned into \(H\) layers, i.e. \(S_0, S_1, \ldots, S_H\).

Paralleling bandits, the MDP proceeds over \(T\) episodes, and for each episode \(t\), the environment defines a loss function, \(\ell_t(s,a) \in [0,1]\), of which the learner incurs for taking some action \(a\) in state \(s\). Deviating from bandits, the environment also defines a transition function, \(P_t(\cdot|s,a)\), that dictates the distribution of the next state, \(s'\), the learner will travel to after taking \(a\) in state \(s\). These transitions can only go from a state in one layer, \(h\), to a state in the next layer, \(h+1\), and we assume the final layer, \(H\), contains only one state, \(s_H\) (this is where “Markov” comes from, as the transition to the next state depends only on the action taken in the current state).

For each episode \(t\), the learner decides on a policy \(\pi_t\) which is a mapping from \(\mathcal{S} \times \mathcal{A} \rightarrow [0,1]\) that gives the probability of taking some action \(a\) when visiting state \(s\). The learner’s interaction protocol for episode \(t\) is as follows:

Learner decides on \(\pi_t\) and starts at state \(s_0\).
For every \(h \in \{0, \ldots, H-1\}\) with according state visited \(s_h\), the learner:
1. selects \(a_h\) sampled from \(\pi_t(\cdot|s_h)\)
2. incurs loss \(\ell_t(s_h,a_h)\)
3. moves to \(s_{h+1}\) sampled from \(P_t(\cdot|s_h,a_h)\)
The learner receives feedback and updates their policy accordingly.

As with any other RL problem, we define regret, which we wish to minimize. Before this, we must define value functions. We write: \[ V^\pi(s,\ell) = \mathbb{E}\left[\sum_{h'=h}^{H-1} \ell(s_{h'}, \pi(s_{h'})) | s_h = s\right] \] to denote the V-value function, which intuitively, gives the total losses starting from state \(s_h\) and following policy \(\pi\) the learner is expected to incur across the rest of the episode (example, for the last state, \(V(s_H, \ell) = 0\)). We also write: \[ Q^\pi(s,a,\ell) = \ell(s,a) + \mathbb{E}_{s' \sim P(\cdot|s,a)}[V^\pi(s', \ell)] \] to denote the Q-value function, which intuitively, gives the expected total losses from taking action \(a\) in state \(s\), following policy \(\pi\) for the rest of the episode. Note how we can define the Q-value function in terms of the V-value function, by taking the instant loss incurred, then adding the expectation of the V-value function across the next states. We can do the vice-versa for the V-value function by writing: \[ V^\pi(s,\ell) = \mathbb{E}_{a\sim \pi(\cdot|s)}[Q^\pi(s,a,\ell)] \] which takes the expectation of the Q-value function across the actions sampled from policy \(\pi\). These are what are sometimes referred to as the Bellman-optimality equations and are very important for MDP analysis.

With these value functions, we can define our regret as follows. Recall \(s_0\) is the first state for every episode \(t\). \[ R_T := \sum_{t=1}^{T} V_t^{\pi_t}(s_0) - \min_{\pi} \sum_{t=1}^{T} V_t^{\pi}(s_0) \] This gives the difference between the total loss incurred and the would-be loss incurred following the single, fixed, loss-minimizing policy, which we denote \(\pi^*\).

Pretty heavy notation! Yet this structure formalizes a fairly robust model for sequential decision-making under uncertainty and admits many interesting problem variants for us researchers to study.

Adversarial MDPs and Corruption

One of the more challenging MDP variants to study is an adversarial MDP. As with adversarial bandits, the environment now can arbitrarily decide loss functions \(\ell_t\) for each episode (in contrast to losses being sampled from a fixed distribution across episodes). This makes learning policies (at least, deterministic ones) challenging as the environment can decide these losses with knowledge of the learner’s algorithm (for this reason, the environment per se is often called the “adversary”). Similar to the adversarial bandit (see 3), this means the learner must adopt a random policy (i.e. sampling actions from some distribution decided by their algorithm) to achieve optimal regret.

Previous work (see Luo et al., 2021) has solved this challenging environment with policy optimization. Because of the adversarial losses, they develop a new dilated exploration bonus for their Q-value function estimators, which encourages more exploration of states (particularly in deeper layers) and actions. Using this technique, they achieve an optimal regret bound on the order of \(\mathcal{O}(\sqrt{T})\).

Even more challenging is an adversarial MDP with corruption, which essentially now includes adversarially-chosen transition functions. While in a standard MDP, the transition functions \(P(\cdot|s,a)\) are fixed from episode to episode, in corrupted MDPs, these transition functions vary from episode to episode, so we index the transition function, \(P_t\).

Corrupted MDP literature is extensive, but one particularly notable result is that with corrupted transitions, the regret can have a lower bound that depends exponentially on episode length (see Lemma 1 of Tian et. al, 2021). With this “limit” in mind, one reasonable goal to aim for in corrupted settings is to have the regret scale with the “level of corruption”.

To quantify this “level of corruption” rigorously, we use the \(C^P\) measure defined in Jin et. al (2023) defined below: \[ C^P := \min _{P^{\prime} \in \mathcal{P}} \sum_{t=1}^T \sum_{h=0}^{H-1} \max _{(s, a) \in S_h \times A}\left\|P_t(\cdot \mid s, a)-P^{\prime}(\cdot \mid s, a)\right\|_1. \] Intuitively, this sums up the differences between \(P_t\) (the corrupted transition) and \(P'\) (the “uncorrupted” or “ground truth” transition that minimizes \(C^P\)) across each layer in each episode. As such, this quantity can scale from \(0\) to \(\mathcal{O}(T)\) (linearly grows with \(T\)). This formalizes the goal we aim to achieve in a corrupted MDP, that is, a regret that scales with corruption:

Can we achieve a regret bound on the order of \(\mathcal{O}(\sqrt{T} + C^P)\)?

The answer to this question is yes! In fact, Jin et. al, 2023 developed a divergence-based algorithm that achieves this bound. The key principle they use is a transition confidence set. The exact details of how this is constructed can be found in either our paper or their paper; I’ll state the main idea here for simplicity. For a certain number of episodes, called an epoch, they estimate an empirical transition distribution for each state-action pair. Then, using the known corruption \(C^P\), they create a confidence interval around this estimate (that shrinks as the state-action pair is visited more), such that the true “uncorrupted” transition \(P'\) is within this interval with high probability. Using these bounds, we can estimate the Q-value functions for each state-action effectively and make our algorithm corruption-robust.

And while their algorithm relies on the learner knowing in advance what the total amount of corruption \(C^P\) is, they further expand their results through a black-box reduction that turns their algorithm into one that operates without knowledge of \(C^P\). Their procedure is remarkable (involves instantiating multiple algorithms with different “guesses” on \(C^P\), then running a bandit algorithm over them), and I encourage you to peruse their Section 4) for more information.

Policy Optimization in Corrupted MDPs and \(\texttt{CR-UOB-REPS}\)

For our contribution, we solve the corrupted MDP problem using policy optimization (PO). In addition to its popularity and empirical success as mentioned before, using PO offers many practical implementation advantages, such as the ability to scale efficiently to larger state spaces and being less computationally complex and expensive than other occupancy-measure based algorithms such as the one in Jin et. al, 2023.

However, due to policy optimization’s local search nature, algorithms often fail to converge to global optimal solutions without additional (and often, unrealistic) global exploration assumptions. These constraints were alleviated in part by the aforementioned bonuses introduced in Luo et al., 2021 that, when included in the algorithm, facilitate global exploration by adding more to the value functions of less visited states. This also allows their algorithm, a version of what’s called \(\texttt{UOB-REPS}\), to become more robust to adversarial losses, advancing the policy optimization state-of-the-art.

Yet once again, with corrupted transitions, things prove more difficult. Thus, in our work, our primary contribution was extending the policy optimization algorithm \(\texttt{UOB-REPS}\) by making it corruption-robust, making it able to handle the most extreme environments with the computational efficiency afforded by PO, and our result is what we call Corruption-Robust Upper Occupancy Bound Relative Entropy Policy Search, or just \(\texttt{CR-UOB-REPS}\).

We make two main modifications to \(\texttt{UOB-REPS}\) that allow it to handle corrupted transitions.

We incorporate the same enlarged transition confidence set used in Jin et. al, 2023, which contains the ground-truth transition \(P'\) with high probability.
We fashion new corrupted bonuses (\(b^C, B^C\) in our paper) that, like the preceding bonuses that made the algorithm robust to adversarial losses, encourage even more exploration, making \(\texttt{CR-UOB-REPS}\) robust to adversarial transitions as well.

And just for completeness, I’ll state our main theorem here that shows that using \(\texttt{CR-UOB-REPS}\) in a corrupted MDP indeed achieves the desired \(\mathcal{O}(\sqrt{T} + C^P)\) regret bound:

\(\textbf{Theorem 1: } \texttt{CR-UOB-REPS}\) gives the following regret bound with probability at least \(1-\tilde{O}(\delta):\) \[ R_T \leq \tilde{O}\left( H^4 + |A||S|H^6 + |S|H^2\sqrt{|A|T} + |S||A|H^2C^P\right) \]

And finally, in our paper, we show that using the same black-box reduction technique, we’re able to turn \(\texttt{CR-UOB-REPS}\) into an algorithm that operates without knowledge of the total corruption \(C^P\) that still achieves the same regret order (see Section 6).

Proofs…

I’ll be honest, I only want to write this section just because our decomposition of the regret looks rather cool. To set it up, I’ll introduce the performance difference lemma, which essentially just allows us to rewrite our regret in the following form: \[ R_T = \sum_{t=1}^T \sum_{h=0}^{H-1} \mathbb{E}_{s_h \sim \pi^{\star}}\left[\left\langle\pi_t(\cdot \mid s_h)-\pi^\star (\cdot \mid s_h), Q_t(s_h, \cdot)-B_t(s_h, \cdot)\right\rangle\right] \] where \(\pi^\star\) here is the fixed optimal policy (in hindsight). Using this form, we decompose our regret into the following beast: \[ \begin{align} R_T & =\underbrace{\sum_{t=1}^T \sum_{h=0}^{H-1} \mathbb{E}_{s_h \sim \pi^{\star}}\left[\left\langle\pi_t(\cdot \mid s_h), Q_t (s_h, \cdot)-Q^P_t(s_h, \cdot)\right\rangle\right]}_{\text {ERROR}_1}\\ &+ \underbrace{\sum_{t=1}^T \sum_{h=0}^{H-1} \mathbb{E}_{s_h \sim \pi^{\star}}\left[\left\langle\pi_t (\cdot \mid s_h), Q_t^P(s_h, \cdot) - \widehat{Q}_t(s_h, \cdot)\right\rangle\right]}_{\text{BIAS}_1} \\ &+ \underbrace{\sum_{t=1}^T \sum_{h=0}^{H-1} \mathbb{E}_{s_h \sim \pi^{\star}}\left[\left\langle\pi^{\star}(\cdot \mid s_h), \widehat{Q}_t(s_h, \cdot)-Q^P_t(s_h, \cdot)\right\rangle\right]}_{\text {BIAS}_2} \\ &+ \underbrace{\sum_{t=1}^T \sum_{h=0}^{H-1} \mathbb{E}_{s_h \sim \pi^{\star}}\left[\left\langle\pi^{\star}(\cdot \mid s_h), Q^P_t(s_h, \cdot)-Q_t(s_h, \cdot)\right\rangle\right]}_{\text{ERROR}_2} \\ & +\underbrace{\sum_{t=1}^T \sum_{h=0}^{H-1} \mathbb{E}_{s_h \sim \pi^{\star}}\left[\left\langle\pi_t(\cdot \mid s_h)-\pi^{\star}(\cdot \mid s_h), \widehat{Q}_t(s_h, \cdot)-B_t(s_h, \cdot)\right\rangle \right]}_{\text {REG-TERM}} \end{align} \]

where \(Q_t\) represents the Q-value function for \(\pi_t\) under actual corrupted transitions from the environment, while \(Q_t^P\) denotes the Q-value function for \(\pi_t\) under uncorrupted ground truth transitions (i.e. \(P'\)).

Decomposing our regret in this manner is nice, as the terms BIAS\(_1\), BIAS\(_2\), and REG-TERM have already been extensively studied in Luo et al., 2021, and because our terms also operate under the uncorrupted ground truth transitions, we can use similar techniques to bound the regret of these terms (taking care of some subtleties with the new enlarged transition confidence set and additional bonuses).

The remaining terms, ERROR\(_1\) and ERROR\(_2\), intuitively capture the incurred regret due to the corrupted transitions (note the inclusion of both \(Q^P_t\) and \(Q_t\)). Using a backwards induction technique, we show that these terms are bounded by \(O(|A||S|H^2C^P)\) with the following lemma:

\(\textbf{Lemma 5.1: }\) For any sequence of policies \(\pi_1,...,\pi_T\), we have \[ \sum_{t=1}^T \sum_{h=0}^{H-1} \mathbb{E}_{s_h \sim \pi_t}\left[\left\langle\pi_t(\cdot \mid s_h), Q_t (s_h, \cdot)-Q^P_t(s_h, \cdot)\right\rangle\right]\leq |A||S|H^2C^P \]

Combining them all - with a little bit more work :) - gives us the final regret stated in Theorem 1.

Acknowledgements

This was my first project in Markov Decision Processes, and if there’s anything I learned, it’s that 1) MDP notation is absolutely horrendous, and 2) MDPs are hard. As someone who often has difficulty seeing the future potential in endeavors like these, I would be lying if I didn’t say I was pessimistic at times during the six (ish?) months I’ve been on this project. But I also learned 3) math is undoubtedly collaborative. The people I’d like to recognize in particular:

Steve (Khang Nguyen), who during spring quarter, became an absolute MDP legend and is someone whom I respect for digging deep into material and putting in the elbow grease. We will miss you here at UCLA, and best of luck at USC in the future.
William, whose unrelenting delusion and self-proclaimed humility kept this project on life support. He sacrificed being the best TA in the math department to cultivate the best research group at UCLA, BruinML.
@wjaaaaaat, who came onto this project near the end, but was no doubt a huge boost of optimism in the face of many struggles and setbacks during the process of writing this paper.
Professors Haipeng Luo, Junyan Lu, and Chenyu Wei for their insightful feedback and consultation for this project.

And finally, all the rest of the people at BruinML. o7